This project is centered around predicting flight delays from a number of variables, with each entry in the data set being a specific flight from the year of 2008. The arrival delay of a particular flight is the outcome that we are interested in with various other factors prior to arrival being of interest. This data has flights from numerous carriers in a variety of geographical locations.
These are the packages that we load for various model-building, plotting, and organizational tasks.
library(ISLR)
library(corrplot)
library(discrim)
library(corrr)
library(kknn)
library(knitr)
library(MASS)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(ggrepel)
library(ggimage)
library(rpart.plot)
library(vip)
library(vembedr)
library(janitor)
library(randomForest)
library(stringr)
library("dplyr")
library("yardstick")
tidymodels_prefer()
flight_data <- read.csv('/Users/kerouac/Downloads/DelayedFlights.csv')
save(flight_data, file = '~/project_save_files/flight_data.rda')
We read in the csv file and then saved the file in order load it in after and save computing time.
load(file ='~/project_save_files/flight_data.rda' )
In order to get a clear picture of the data, let’s go over the different variables and a summary of what is represented in the set. The variables and a short description are described below
X: Unique Entry
Year 2008 for all flights
Month: 1-12 is January through February
DayOfMonth: 1-31 possible days of the month
DayOfWeek: 1 (Monday) through 7 (Sunday)
DepTime: Actual departure time in local time
CRSDepTime: Scheduled departure time
ArrTime Actual arrival time in local time
CRSArrTime Scheduled arrival time in local time
UniqueCarrier: Unique Carrier Code (airline)
FlightNum: Specific Flight Number
TailNum: Airplane Tail Number which is unique for each aircraft
ActualElapsedTime: Actual Elapsed Time in minutes
CRSElapsedTime: Scheduled Elapsed Time in minutes AirTime: Time in the air in minutes
Airtime: Time in air, in minutes
ArrDelay: Arrival delay, in minutes
DepDelay: Departure delay, in minutes
Origin: origin IATA airport code
Dest: destination IATA airport code
Distance: Distance of trip in miles
TaxiIn: taxi in time, in minutes
TaxiOut: taxi out time in minutes
Cancelled: was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay in minutes: Carrier delay is any delay that is the fault of the Carrier. Examples of this would be aircraft damage, cleaning of the aircraft, inspection etc.
WeatherDelay: Weather delay is delay caused by hazardous weather
NASDelay in minutes: Delay that is caused by the the National Airspace System (NAS), generally operations at a particular airport such as heavy traffic
SecurityDelay in minutes: Security delay is security measures/events causing delay
LateAircraftDelay in minutes: Arrival delay at an airport that is due late arrival of the same aircraft at a previous airport.
dim(flight_data)
## [1] 1936758 30
Here we can see what we are working with in the data. We have 1,936,758 observations and 30 variables.
set.seed(1234)
carriers <- flight_data$UniqueCarrier
endeavor_air <- flight_data %>%
filter(UniqueCarrier == '9E')
american_airlines <- flight_data %>%
filter(UniqueCarrier == 'AA')
aloha_airlines <- flight_data %>%
filter(UniqueCarrier == 'AQ')
alaska_airlines <- flight_data %>%
filter(UniqueCarrier == 'AS')
jetblue_airlines <- flight_data %>%
filter(UniqueCarrier == 'B6')
continental_air <- flight_data %>%
filter(UniqueCarrier == 'CO')
delta_airlines<- flight_data %>%
filter(UniqueCarrier == 'DL')
expressjet_air <- flight_data %>%
filter(UniqueCarrier == 'EV')
frontier_airlines <- flight_data %>%
filter(UniqueCarrier == 'F9')
airtran_airlines <- flight_data %>%
filter(UniqueCarrier == 'FL')
hawaiian_airlines <- flight_data %>%
filter(UniqueCarrier == 'HA')
envoy_airlines <- flight_data %>%
filter(UniqueCarrier == 'MQ')
northwest_airlines <- flight_data %>%
filter(UniqueCarrier == 'NW')
psa_airlines <- flight_data %>%
filter(UniqueCarrier == 'OH')
skywest_airlines <- flight_data %>%
filter(UniqueCarrier == 'OO')
united_airlines <- flight_data %>%
filter(UniqueCarrier == 'UA')
us_airways<- flight_data %>%
filter(UniqueCarrier == 'US')
southwest_airlines<- flight_data %>%
filter(UniqueCarrier == 'WN')
jsx_airlines<- flight_data %>%
filter(UniqueCarrier == 'XE')
mesa_airlines<- flight_data %>%
filter(UniqueCarrier == 'YV')
all_airlines <- rbind(sample_n(airtran_airlines, 500),sample_n(alaska_airlines,500),sample_n(aloha_airlines,500),sample_n(american_airlines,500),sample_n(continental_air,500),sample_n(delta_airlines,500),sample_n(endeavor_air,500),sample_n(envoy_airlines,500),sample_n(expressjet_air,500),sample_n(frontier_airlines,500),sample_n(hawaiian_airlines,500),sample_n(jetblue_airlines,500),sample_n(jsx_airlines,500),sample_n(mesa_airlines,500),sample_n(northwest_airlines,500),sample_n(psa_airlines,500),sample_n(skywest_airlines,500),sample_n(southwest_airlines,500),sample_n(united_airlines,500),sample_n(us_airways,500))
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'UA'] <- 'United'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'US'] <- 'US Airways'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'WN'] <- 'Southwest'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'XE'] <- 'JSX'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'AS'] <- 'Alaska'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'OO'] <- 'Skywest'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'FL'] <- 'Airtran'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'DL'] <- 'Delta'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'CO'] <- 'Continental'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'B6'] <- 'JetBlue'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'HA'] <- 'Hawaiian'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'AA'] <- 'American'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == '9E'] <- 'Endeavor'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'EV'] <- 'ExpressJet'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'F9'] <- 'Frontier'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'OH'] <- 'PSA'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'YV'] <- 'Mesa'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'NW'] <- 'Northwest'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'AQ'] <- 'Aloha'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'MQ'] <- 'Envoy'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'B6'] <- 'JetBlue'
unique(all_airlines$UniqueCarrier)
## [1] "Airtran" "Alaska" "Aloha" "American" "Continental"
## [6] "Delta" "Endeavor" "Envoy" "ExpressJet" "Frontier"
## [11] "Hawaiian" "JetBlue" "JSX" "Mesa" "Northwest"
## [16] "PSA" "Skywest" "Southwest" "United" "US Airways"
These are the various carriers that are operating the flights. There are 20 unique carriers
unique(all_airlines$Origin)
## [1] "MEM" "FNT" "SEA" "EWR" "ATL" "BWI" "DAB" "PNS" "RSW" "PBI" "BUF" "MLI"
## [13] "MCO" "SJU" "IAD" "MDW" "HPN" "DFW" "LGA" "MKE" "SAN" "IND" "FLL" "ROC"
## [25] "CAK" "PIT" "DTW" "MSY" "PHF" "LAS" "CLT" "LAX" "MIA" "BOS" "DAY" "PHL"
## [37] "SRQ" "MCI" "RDU" "TPA" "JAX" "DEN" "SFO" "STL" "DCA" "BMI" "PHX" "CHS"
## [49] "SWF" "MSP" "SAV" "GPT" "SAT" "RIC" "BET" "ANC" "LGB" "JNU" "AKN" "GEG"
## [61] "FAI" "ONT" "PSP" "PDX" "OTZ" "BRW" "OAK" "KTN" "SJC" "SIT" "ORD" "YAK"
## [73] "DLG" "SCC" "BUR" "SNA" "SMF" "ADQ" "OME" "PSG" "HNL" "KOA" "LIH" "OGG"
## [85] "ITO" "RNO" "MFE" "AUS" "JFK" "ELP" "TUS" "IAH" "SDF" "TUL" "XNA" "OKC"
## [97] "STT" "HSV" "COS" "BNA" "BDL" "ICT" "EGE" "ABQ" "CLE" "HDN" "BHM" "CMH"
## [109] "SLC" "CVG" "ORF" "HOU" "GSO" "JAC" "TVC" "DSM" "JAN" "TLH" "OMA" "BTV"
## [121] "GSP" "LEX" "TYS" "DLH" "LIT" "BGR" "ERI" "GTF" "SBN" "PWM" "PIA" "SHV"
## [133] "ABE" "MGM" "HLN" "GRB" "AZO" "LAN" "MOB" "CWA" "IDA" "FSD" "CAE" "FWA"
## [145] "GRR" "STX" "SGF" "BIS" "CID" "PLN" "ELM" "EVV" "PFN" "MSN" "MYR" "MDT"
## [157] "AVP" "MAF" "ACT" "RST" "GRK" "LBB" "SBA" "SYR" "VPS" "LAW" "FSM" "LRD"
## [169] "TOL" "ABI" "BTR" "CHA" "LSE" "MLU" "DAL" "CMI" "MQT" "SPS" "DBQ" "GGG"
## [181] "ROW" "LFT" "PVD" "SJT" "CLL" "ATW" "ALB" "MHT" "FAY" "CRW" "AVL" "VLD"
## [193] "OAJ" "CSG" "TRI" "EWN" "AGS" "ROA" "MLB" "MCN" "ILM" "EYW" "GNV" "ABY"
## [205] "BQK" "AEX" "LYH" "GTR" "BOI" "PSE" "BQN" "AMA" "HRL" "LCH" "ACK" "MRY"
## [217] "TEX" "YUM" "BFL" "SBP" "SPI" "RKS" "ASE" "FLG" "COD" "GCC" "FAT" "DRO"
## [229] "RAP" "MTJ" "GJT" "GFK" "FCA" "FAR" "MOT" "PSC" "ACV" "MSO" "RDM" "SGU"
## [241] "SMX" "MOD" "BTM" "LNK" "MFR" "BZN" "CPR" "EUG" "RDD" "CLD" "SUN" "CIC"
## [253] "LMT" "BIL" "CEC" "TWF" "ISP" "CRP"
length(unique(all_airlines$Origin))
## [1] 258
These are the various Origins by their IATA code. We have 258 unique origins
unique(all_airlines$Dest)
## [1] "ATL" "TPA" "BWI" "SRQ" "RSW" "IND" "MDW" "ROC" "MSP" "IAD" "MCO" "FNT"
## [13] "MLI" "PHX" "PBI" "LAS" "DCA" "SFO" "MKE" "PNS" "EWR" "FLL" "LGA" "CHS"
## [25] "PHL" "BOS" "MCI" "PHF" "JAX" "HOU" "RIC" "CAK" "CLT" "PIT" "BUF" "RDU"
## [37] "LAX" "DFW" "STL" "DAY" "DTW" "SJU" "HPN" "ICT" "SAN" "BMI" "MSY" "SWF"
## [49] "GPT" "MEM" "DAB" "DEN" "SEA" "MIA" "SAT" "PWM" "CMH" "ANC" "PDX" "YAK"
## [61] "OTZ" "SNA" "PSP" "OAK" "OME" "KTN" "SIT" "SMF" "BRW" "GEG" "FAI" "TUS"
## [73] "WRG" "PSG" "LGB" "AKN" "JNU" "ORD" "ONT" "LIH" "CDV" "SJC" "SCC" "BET"
## [85] "HNL" "BUR" "DLG" "ITO" "RNO" "OGG" "KOA" "COS" "OMA" "MFE" "TUL" "JFK"
## [97] "ELP" "BNA" "ABQ" "SLC" "AUS" "MTJ" "OKC" "HSV" "IAH" "STT" "BDL" "CLE"
## [109] "BTR" "BQN" "CVG" "SAV" "BHM" "JAC" "MLB" "TYS" "CAE" "ORF" "VPS" "ERI"
## [121] "PIA" "CID" "LEX" "XNA" "SCE" "MOB" "SDF" "MBS" "AVP" "SGF" "AVL" "JAN"
## [133] "SBN" "DSM" "FSD" "ROA" "LIT" "GFK" "DLH" "FAR" "TLH" "PFN" "SHV" "ELM"
## [145] "MSN" "SYR" "GRR" "GSP" "GSO" "ABE" "TVC" "BZN" "AZO" "LAN" "BTV" "BGM"
## [157] "ATW" "FWA" "BIS" "MGM" "HLN" "CHA" "LNK" "CMX" "MDT" "MYR" "MQT" "ABI"
## [169] "GGG" "PVD" "AEX" "AMA" "LFT" "LBB" "FSM" "TYR" "SJT" "CLL" "GRK" "TXK"
## [181] "ACT" "LRD" "ALB" "CMI" "GJT" "DAL" "ROW" "LSE" "DBQ" "CWA" "GRB" "RST"
## [193] "SBA" "TOL" "CRP" "SPS" "EVV" "MAF" "FAT" "LAW" "CHO" "TRI" "GNV" "GTR"
## [205] "OAJ" "CRW" "DHN" "EYW" "FAY" "ABY" "MLU" "FLO" "AGS" "HHH" "VLD" "ILM"
## [217] "CSG" "EWN" "MHT" "BOI" "PSE" "BPT" "BRO" "HRL" "BFL" "BGR" "IDA" "MRY"
## [229] "ACK" "DRO" "SBP" "CPR" "YUM" "ASE" "GUC" "EUG" "EGE" "SPI" "MFR" "RAP"
## [241] "BIL" "MSO" "SUN" "CEC" "RDD" "IPL" "CIC" "HDN" "RFD" "RDM" "CLD" "PMD"
## [253] "PSC" "TWF" "MOD" "GTF" "OXR" "SMX" "FCA" "OTH" "ACV" "BTM" "SLE" "ISP"
length(unique(all_airlines$Dest))
## [1] 264
These are the various destinations by their IATA code. There are 264 unique destinations
all_airlines_full_values <- drop_na(all_airlines)
all_airlines_full_values$Month <- as.factor(all_airlines_full_values$Month)
all_airlines_full_values$DayOfWeek <- as.factor(all_airlines_full_values$DayOfWeek)
Here we are dropping the NA values from our dataframe because these values do not help with our models. Additionally, it makes sense to change the Month and Day of Week into ordered factors there are a set number of levels for each of these (12 months in a year, 7 days in a week)
ggplot(all_airlines_full_values)+ geom_boxplot(aes(y = ArrDelay, x = ordered(Month, levels = c(1:12))), colour = 'Blue', fill = 'Orange')+labs(title = "Delays By Month",
subtitle = 'Boxplot',
x = "Month", y = "Delay")
Here is a box plot that shows how delays are distributed throughout the year depending on the month. We can get a general idea about the mean and variances of each month from this plot.
by_month_dat <- all_airlines_full_values %>%
group_by(month = Month) %>%
summarize(ArrDelay = mean(ArrDelay))
ggplot(by_month_dat)+
geom_col(aes(x =month, y = ArrDelay), fill ='blue')
This plot shows the mean Arrival Delay based upon the month of the year which shows that there isn’t a great amount of variation from month to month.
by_month_weather <- all_airlines_full_values %>%
group_by(month = Month) %>%
summarize(WeatherDelay = mean(WeatherDelay))
ggplot(by_month_weather)+
geom_col(aes(x =month, y = WeatherDelay), fill = 'forestgreen')
This plot shows the average Weather Delay based upon the month of the year, with some months seeing more of a weather delay than others.
various_origins <- all_airlines_full_values %>%
group_by(Origin)
origin_delay_flights <- various_origins %>%
summarise(avg_delay = mean(ArrDelay),
num_of_flights = length(unique(FlightNum)))
We can examine the amount of flights for each Origin and the mean delay for that origin.
origins_of_interest <- origin_delay_flights %>%
filter(num_of_flights>50) %>%
arrange(-avg_delay)
ggplot(origins_of_interest)+
geom_col(aes(x = Origin, y = avg_delay), fill = 'red4')+
labs(y = 'Average Delay', title = 'Average Delay of Popular Airports')
We are checking here to see the Origins with the greatest delays where there are greater than 50 entries from that particular origin. The reason for this is because these are among the most popular airports in our dataset and give insight into the distribution of delays. Baltimore-Washington, JFK, and Detroit Metropolitan are among top 3 airports with the greatest average delay.
group_by_uniquecarrier <- all_airlines_full_values %>%
group_by(UniqueCarrier)
carrier_graphing_data <- group_by_uniquecarrier %>%
summarise(avg_delay = mean(ArrDelay))
ggplot((carrier_graphing_data[1:10,])) +
geom_col(aes(x = UniqueCarrier, y = avg_delay), fill = 'maroon3')+
labs(x = 'Unique Carrier', y = "Average Carrier Delay Minutes")
ggplot((carrier_graphing_data[10:20,])) +
geom_col(aes(x = UniqueCarrier, y = avg_delay), fill = 'maroon3')+
labs(x = 'Unique Carrier', y = "Average Carrier Delay Minutes")
The reason that the graphs are split up is because it would be too difficult to read the names of each airline on the x-axis if they were all on one. JetBlue and Endeavor are the two airlines with the greatest average delay, both being over 40 minutes
set.seed(1234)
flight_split <- initial_split(all_airlines_full_values, prop = 0.80,
strata = ArrDelay)
flight_train <- training(flight_split)
flight_test <- testing(flight_split)
We are splitting the data here in order to have both a training set to train our models and testing set to verify that our models are working well outside of the training data. A proportion of 0.8 was used for the split because we have a large number of observations and the testing set will be sufficiently large. We are stratifying on our outcome variable that we are interested in which is Arrival Delay.
flight_folds <- vfold_cv(flight_train, v = 10)
Flight folds will be used for our cross-validation which is useful because rather than having just a training and testing set we will have multiple folds that can each be used as validation sets while the other folds are used as the training sets. Each of the folds validate each other which is useful for fitting our models.
flight_recipe <- recipe(ArrDelay~UniqueCarrier+Month+CRSDepTime+CRSArrTime+CRSElapsedTime+Distance+TaxiIn+TaxiOut+DepDelay, data = flight_train)%>%
step_center() %>%
step_scale() %>%
step_corr(threshold = 0.8) %>%
step_dummy(all_nominal_predictors())
Our aim with this recipe is to be able to predict the Arrival Delay of a flight from various factors that come before the arrival. These various factors include: Unique Carrier, Month, Scheduled Departure Time, Scheduled Arrival Time, Scheduled Elapsed Time, Distance of flight, Taxi In and Taxi Out time, and overall Departure Delay.
With this following linear regression model we add in our recipe, with Arrival Delay being the numeric outcome. We create our workflow and then fit the model to the training data. Let’s see what our results are.
lm_model <- linear_reg() %>%
set_mode('regression') %>%
set_engine("lm")
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(flight_recipe)
linear_fit <- fit(lm_wflow, data = flight_train)
linear_fit %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 38 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.41 1.30 -1.09 2.77e- 1
## 2 CRSDepTime -0.00104 0.000579 -1.79 7.34e- 2
## 3 CRSArrTime -0.000973 0.000531 -1.83 6.68e- 2
## 4 CRSElapsedTime -0.294 0.0138 -21.4 3.70e- 97
## 5 Distance 0.0336 0.00167 20.1 1.75e- 86
## 6 TaxiIn 0.941 0.0309 30.5 1.62e-187
## 7 TaxiOut 0.890 0.0109 81.4 0
## 8 DepDelay 0.976 0.00268 364. 0
## 9 UniqueCarrier_Alaska 0.286 1.09 0.262 7.94e- 1
## 10 UniqueCarrier_Aloha 0.217 1.27 0.171 8.64e- 1
## # … with 28 more rows
predictions_linear <- predict(linear_fit, new_data = flight_train)
flight_metrics <- metric_set(rmse, rsq, mae)
flight_metrics(predictions_linear, truth = flight_train$ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 11.9
## 2 rsq standard 0.965
## 3 mae standard 7.95
After using our model to predict the testing data and then evaluating our metrics we can see that our RMSE (root mean squared error) is 11.87 and our rsq (r-squared) value is 0.96. Our mae (mean absolute error) is 7.95. Generally this would indicate that our errors are not ideal but this model explains the variation in the data well as shown from our rsq value.
ridge_recipe <-
flight_recipe %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
ridge_spec <- linear_reg(mixture = 0, penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
ridge_wflow <- workflow() %>%
add_model(ridge_spec) %>%
add_recipe(ridge_recipe)
penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
ridge_tune_res <- tune_grid(
ridge_wflow,
resamples = flight_folds,
grid = penalty_grid
)
best_penalty <- select_best(ridge_tune_res, metric = "rmse")
best_penalty
## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 0.00001 Preprocessor1_Model01
ridge_final <- finalize_workflow(ridge_wflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = flight_train)
augment(ridge_final_fit, new_data = flight_test) %>%
metrics(truth = ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 12.9
## 2 rsq standard 0.967
## 3 mae standard 8.94
autoplot(ridge_tune_res)
Our ridge regression was tuned by having the penalty range between -5 and 5, and the model by using different penalties and using resampling from our flight folds. After fitting this Ridge Regression we looked at the best RMSE value which was 12.87. This was from Preprocessor1_Model01. The penalty was 1e-05. The rsq was 0.967 and mae was 8.94
knn_spec <-
nearest_neighbor(
neighbors = tune(),
mode = "regression") %>%
set_engine("kknn")
knn_wkflow <- workflow() %>%
add_model(knn_spec) %>%
add_recipe(flight_recipe)
knn_parameters<- parameters(knn_spec)
knn_grid <- grid_regular(knn_parameters, levels = 2)
tuned_neighbors <- knn_wkflow %>%
tune_grid(
resamples = flight_folds,
grid = knn_grid)
save(tuned_neighbors, file = '~/project_save_files/tuned_neighbors_model.rda')
load(file = '~/project_save_files/tuned_neighbors_model.rda')
autoplot(tuned_neighbors)
show_best(tuned_neighbors, metric = 'rmse')
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 42.3 10 0.798 Preprocessor1_Model1
## 2 15 rmse standard 43.2 10 1.08 Preprocessor1_Model2
We can see that our RMSE for the K Nearest Neighbors tuned model is higher than our other models.
For the following models, we will save the results and then load them from their respective files to save time when computing. The analysis will be done after loading the files in.
The random forest model has several important arguments:\(\textbf{mtry}\) is essentially a measurement of how many predictors will be randomly sampled each time there is a split for the tree models, ours had a range of 1-4.\(\textbf{trees}\) is simply the number of trees that will be contained in the tuned model, our range was from 90-160. \(\textbf{min_n}\) is the minimum amount of data points contained in a node/leaf in order for it to be split again with our range being from 10-150. We will be able to select which values work best for these parameters using tuning.
r_forest_model <- rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression")
r_forest_wflow <- workflow() %>%
add_model(r_forest_model %>% set_args(mtry = tune(),
trees = tune(),
min_n = tune())) %>%
add_recipe(flight_recipe)
forest_grid <- grid_regular(mtry(range = c(2, 12)),
trees(range = c(90, 160)),
min_n(range = c(10, 150)), levels = 8)
rf_tune_res <- tune_grid(r_forest_wflow, resamples = flight_folds, grid = forest_grid, metrics = metric_set(rsq, rmse))
save(rf_tune_res, file = '~/project_save_files/flight_randf.rda')
save(r_forest_wflow, file = '~/project_save_files/r_forest_wflow.rda')
For our boosted tree our \(\textbf{mtry}\) ranged from 2-20, our \(\textbf{trees}\) ranged from 100-190 and our \(\textbf{min_n}\) ranged from 10-150. We also have a learn rate which ranges from -2 - 0.5 in our case.
boost_spec <- boost_tree(trees = tune(),
mtry = tune(),
min_n = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
boost_wflow <- workflow() %>%
add_recipe(flight_recipe) %>%
add_model(boost_spec)
boost_grid <- grid_regular(mtry(range = c(2, 20)),
trees(range = c(100, 190)),
min_n(range = c(10, 150)), learn_rate(range = c(-2,0.5)))
boost_tune_res <- tune_grid(boost_wflow, resamples = flight_folds, grid = boost_grid, metrics = metric_set(rsq, rmse))
save(boost_tune_res, file ='~/project_save_files/boost_tune_res.rda')
load(file = '/Users/kerouac/project_save_files/boost_tune_res.rda')
load(file = '/Users/kerouac/project_save_files/flight_randf.rda')
load(file = '/Users/kerouac/project_save_files/r_forest_wflow.rda')
Let’s Analyze the random forest results
show_best(rf_tune_res, metric = 'rmse')
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 12 110 10 rmse standard 9.35 10 0.180 Preprocessor1_Model0…
## 2 12 130 10 rmse standard 9.37 10 0.177 Preprocessor1_Model0…
## 3 12 140 10 rmse standard 9.39 10 0.175 Preprocessor1_Model0…
## 4 12 160 10 rmse standard 9.39 10 0.171 Preprocessor1_Model0…
## 5 12 150 10 rmse standard 9.39 10 0.177 Preprocessor1_Model0…
show_best(rf_tune_res, metric = 'rsq')
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 12 110 10 rsq standard 0.718 10 0.0105 Preprocessor1_Model0…
## 2 12 130 10 rsq standard 0.717 10 0.0104 Preprocessor1_Model0…
## 3 12 160 10 rsq standard 0.716 10 0.0101 Preprocessor1_Model0…
## 4 12 140 10 rsq standard 0.716 10 0.0103 Preprocessor1_Model0…
## 5 12 150 10 rsq standard 0.716 10 0.0103 Preprocessor1_Model0…
Our tuned random forest yielded the best model (Preprocessor1_Model024) with an RMSE of 9.34 and an R squared value of 0.718. This was from an \(\textbf{mtry}\) value of 12, \(\textbf{trees}\) value of 110, and \(\textbf{min_n}\) value of 10.
autoplot(rf_tune_res)
We can see the general trend that as the number of selected predictors increases, the RMSE decreases. Since this forest was created with a large number of trees for each model, the effect of the number of trees is less pronounced. With this being said, around 110 trees yielded the most optimal results
Now to look at the boosted tree model
show_best(boost_tune_res, metric = 'rmse')
## # A tibble: 5 × 10
## mtry trees min_n learn_rate .metric .estimator mean n std_err .config
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 11 100 10 0.178 rmse standard 16.5 10 1.96 Preproces…
## 2 11 145 10 0.178 rmse standard 16.6 10 1.97 Preproces…
## 3 11 190 10 0.178 rmse standard 16.7 10 1.99 Preproces…
## 4 20 100 10 0.178 rmse standard 16.7 10 2.25 Preproces…
## 5 20 145 10 0.178 rmse standard 16.9 10 2.28 Preproces…
show_best(boost_tune_res, metric ='rsq')
## # A tibble: 5 × 10
## mtry trees min_n learn_rate .metric .estimator mean n std_err .config
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 11 100 10 0.178 rsq standard 0.934 10 0.0107 Preproces…
## 2 11 145 10 0.178 rsq standard 0.933 10 0.0109 Preproces…
## 3 20 100 10 0.178 rsq standard 0.932 10 0.0135 Preproces…
## 4 11 190 10 0.178 rsq standard 0.932 10 0.0111 Preproces…
## 5 20 190 10 0.178 rsq standard 0.931 10 0.0140 Preproces…
autoplot(boost_tune_res)
From our boosted tree model we can see that an \(\textbf{mtry}\) value of 20, \(\textbf{trees}\) value of 100, a learn rate of 0.177 and \(\textbf{min_n}\) value of 10 gave us our best metrics.The best rsq value was 0.9358 and the best RMSE was 16.32
From the various models that we created, the overall best metrics were seen with the linear regression model and the Random Forest Model in terms of RMSE. There might have been a degree of overfitting so we will look at both. Let’s visualize the performance in comparison to the testing set further.
delay_predict <- predict(linear_fit,
new_data = flight_test)
delay_prediction_compare <- delay_predict %>%
bind_cols(Actual = flight_test$ArrDelay)
delay_prediction_compare
## # A tibble: 1,295 × 2
## .pred Actual
## <dbl> <dbl>
## 1 16.1 19
## 2 77.0 71
## 3 29.0 19
## 4 146. 157
## 5 30.2 21
## 6 21.9 24
## 7 20.9 15
## 8 103. 100
## 9 54.4 49
## 10 22.0 15
## # … with 1,285 more rows
Here is a quick look at compared versus actual values.
ggplot(delay_prediction_compare,
aes(x = .pred,
y = Actual)) +
geom_point() +
xlim(0,400)+
ylim(0,400)+
geom_abline(intercept = 0,
slope = 1,
color = "red3",
size = 1)+
labs(title = 'Linear Regression Model',x = 'Predicted')
Here is a graphical representation of how the predicted values compare to the actual values.
linear_metrics <- metric_set(rmse, rsq, mae)
linear_metrics(delay_predict, truth = flight_test$ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 10.8
## 2 rsq standard 0.971
## 3 mae standard 7.46
Our metrics for the linear model when fitting to the testing data were even slightly better than the training data which means that our model fits the entirety of the data quite well. RMSE here is 10.8, Rsq is 0.97, and Mae is 7.45
rf_final_wf <- r_forest_wflow %>%
finalize_workflow(select_best(rf_tune_res, metric = "rmse"))
results_of_final_wf <- fit(rf_final_wf, flight_train)
best_forest_predict <- predict(results_of_final_wf,
new_data = flight_test)
delay_prediction_compare_rf <- best_forest_predict %>%
bind_cols(Actual = flight_test$ArrDelay)
delay_prediction_compare_rf
## # A tibble: 1,295 × 2
## .pred Actual
## <dbl> <dbl>
## 1 25.2 19
## 2 85.3 71
## 3 28.4 19
## 4 136. 157
## 5 27.9 21
## 6 21.0 24
## 7 26.0 15
## 8 94.9 100
## 9 51.3 49
## 10 22.3 15
## # … with 1,285 more rows
This tibble shows the predictions versus actual values.
ggplot(delay_prediction_compare_rf,
aes(x = .pred,
y = Actual)) +
geom_point() +
xlim(0,400)+
ylim(0,400)+
geom_abline(intercept = 0,
slope = 1,
color = "red3",
size = 1)+
labs(title = 'Random Forest Model',x = 'Predicted')
This is the same actual versus predicted graph but for the random forest
rf_metrics <- metric_set(rmse, rsq, mae)
rf_metrics(best_forest_predict, truth = flight_test$ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 19.3
## 2 rsq standard 0.918
## 3 mae standard 9.54
Our RMSE value here is 19.46 which is much higher than it was for the training data. RSQ is 0.91 and MAE is 9.57.
In conclusion, the models that we have created all had unique results. Initially, the data seemed to be somewhat difficult to fit to models given the large number of observations and variables. After shrinking down the model and cleaning the data the model fitting became much easier. This highlights the importance of becoming familiar with the data prior to implementing models.
Ultimately, both the random forest model and the linear regression model had good metrics when creating them. We aimed to see how various factors prior to arrival can affect arrival delay, and these two models gave us the best results compared to the others. However, once we fit the models to the testing data, the regression model performed better indicating that there might have been some overfitting with the random forest.
The idea behind this project was very fascinating and there are some possible improvements that can be implemented for similar types of projects. Firstly, other variables and data that can predict delays would potentially allow for even more interesting models. Having a dataset with more obscure data that could potentially predict delays better would be fascinating to explore. Secondly, perhaps unsupervised learning methods like neural networks would be effective for this type of data.
Thats all, folks.
